Impact of home, socioeconomic and demographic factors on student grade
                     

Peoples are constantly affected by the environment and the variables from which they surround themselves, If that individual is in the early stage of his life like under 25 then he is more prone to his environment. The same is applied to the student too, Every student’s life is composed and stimulated by different features whether it is a low-income family, family traditions, their parent’s education, community involvement, or race. Apart from these, there are other factors too like their social group on which they spend their free time, the support from parents, The activities on which they involve in their free time. So there are so many variables around which a student was surrounded and this environment impacts their behavior and ease of life, but do these things have any impact on the performance of their grades. This is a very important question that needs to be answered and Teachers need to understand that every student has been influenced by these demographic characteristics to be effective. So to answer these questions and analyze the impact of these attributes on the grades of the student I will use the student achievement dataset of two Portuguese schools. The dataset contains the grades scored by 395 students in the last three periods and also records different environmental factors on which these students belong like age, sex, school name, parent’s education status, internet access to the student. Like these, there is 33 variable but I am not going to consider all these attributes because here I want to analyze is the parent’s education background, family support, internet availability and other related factor affecting the student grades or not.

#Load the requied libraries.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(readr)
library(ggcorrplot)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
#library(devtools)
#install.packages("remotes")
#remotes::install_github("benmarwick/wordcountaddin")

#devtools::install_github("benmarwick/wordcountaddin")

for analysis there are two dataset one dataset contains the grades for mathematics and the other one contains the grades for portuguese language. I will use Math’s grade data here for my analysis.

#load the datset.
Math_perf <- read_csv("student-mat.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   age = col_double(),
##   Medu = col_double(),
##   Fedu = col_double(),
##   traveltime = col_double(),
##   studytime = col_double(),
##   failures = col_double(),
##   famrel = col_double(),
##   freetime = col_double(),
##   goout = col_double(),
##   Dalc = col_double(),
##   Walc = col_double(),
##   health = col_double(),
##   absences = col_double(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
head(Math_perf) # To see the first 6 rows.
## # A tibble: 6 x 33
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob    Fjob   reason  
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>   <chr>  <chr>   
## 1 GP     F        18 U       GT3     A           4     4 at_home teach… course  
## 2 GP     F        17 U       GT3     T           1     1 at_home other  course  
## 3 GP     F        15 U       LE3     T           1     1 at_home other  other   
## 4 GP     F        15 U       GT3     T           4     2 health  servi… home    
## 5 GP     F        16 U       GT3     T           3     3 other   other  home    
## 6 GP     M        16 U       LE3     T           4     3 servic… other  reputat…
## # … with 22 more variables: guardian <chr>, traveltime <dbl>, studytime <dbl>,
## #   failures <dbl>, schoolsup <chr>, famsup <chr>, paid <chr>,
## #   activities <chr>, nursery <chr>, higher <chr>, internet <chr>,
## #   romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## #   Walc <dbl>, health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>
str(Math_perf) # Structure of the dataset
## spec_tbl_df[,33] [395 × 33] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:395] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:395] "F" "F" "F" "F" ...
##  $ age       : num [1:395] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:395] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:395] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:395] "A" "T" "T" "T" ...
##  $ Medu      : num [1:395] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:395] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:395] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:395] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:395] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:395] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:395] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:395] 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : num [1:395] 0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr [1:395] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:395] "no" "yes" "no" "yes" ...
##  $ paid      : chr [1:395] "no" "no" "yes" "yes" ...
##  $ activities: chr [1:395] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:395] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:395] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:395] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:395] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:395] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:395] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:395] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:395] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:395] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:395] 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : num [1:395] 6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : num [1:395] 5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : num [1:395] 6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : num [1:395] 6 6 10 15 10 15 11 6 19 15 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   failures = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   paid = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double()
##   .. )
#Transform all the character variable into factor.
Math_perf[sapply(Math_perf, is.character)] <- lapply(Math_perf[sapply(Math_perf, 
                                                                         is.character)], as.factor)

#Summary of the dataset.
summary(Math_perf)
##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason      guardian  
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
##  Median :2.000   other   :141   other   :217   other     : 36   other : 32  
##  Mean   :2.522   services:103   services:111   reputation:105               
##  3rd Qu.:3.000   teacher : 58   teacher : 29                                
##  Max.   :4.000                                                              
##    traveltime      studytime        failures      schoolsup famsup     paid    
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344   no :153   no :214  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:242   yes:181  
##  Median :1.000   Median :2.000   Median :0.0000                                
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                                
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                                
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                                
##  activities nursery   higher    internet  romantic      famrel     
##  no :194    no : 81   no : 20   no : 66   no :263   Min.   :1.000  
##  yes:201    yes:314   yes:375   yes:329   yes:132   1st Qu.:4.000  
##                                                     Median :4.000  
##                                                     Mean   :3.944  
##                                                     3rd Qu.:5.000  
##                                                     Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00
#checking for NA Values.
anyNA(Math_perf) #There is no NA values in the dataset.  
## [1] FALSE

Summary of the Dataset

The dataset has 33 variables, I am not going to consider all the variables for my analysis, so I will explain those variables only which are part of my research question.

I will use ggplot and plotly both for the visualization and tidyverse and dplyr for the data manipulation.

Research Question 1: Are the school type and gender of the student have any relation with their grades.

attach(Math_perf)

plot_ly(Math_perf,x = ~school,y = ~G1,type = 'box',color = ~sex) %>% add_trace(y = ~G2,x = ~school,color = ~sex) %>% add_trace(y = ~G3,x = ~school,color = ~sex) %>% layout(boxmode = "group",title = "Grades of student's in last 3 exams differentiated based on their school and sex",yaxis = list(title ="Marks scored in 3 exams"),xaxis = list(title = "Sex and school of student"))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

Answer: Here the first six boxplots refer to GP school while the last 6 for MS. In GP the first two boxplots are for female and male respectively for the first exam, the next two for the second-period exam, and the last two for the third-period exam, same in the case of MS too. I use this graph to plot the bivariate analysis between the variable so that from the same plot I can get the relation of school and gender between grades. The result of both the school is almost the same in the first exam but it starts improving for GP for next two period but there is no change in performance in cases of MS. In the case of GP school, male candidates perform better than females in all three exams while in the case of MS school the performance of female students is better than male students. So after analyzing these details I can say that yes there is some evidence that student grade depends upon the school and gender, overall GP performs better than MS, but the girls of GP perform low compare to girls of MS.

Research Question 2: Is the Area pf the student from where they belong affecting there marks.

plot_ly(Math_perf,x = ~address,y = ~G1,type = 'violin',color = I('red')) %>% add_trace(y = ~G2,x = ~address,color = I('pink')) %>% add_trace(y = ~G3,x = ~address,color = I('green'))  %>%layout(violinmode = "group",title = "Grades of student's in last 3 exams differentiated based on their Address",yaxis = list(title ="Marks scored in 3 exams"),xaxis = list(title = "Type of address urban(U) or Rural(R)"))
## Warning: 'layout' objects don't have these attributes: 'violinmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

Answer : This is the example of a violin graph, the graph narrows when the count of the variable is less and got wider when the count is maximum. I can see that in both the case the graph got wider between 10 to 12, which means that most of the students perform in this range. so I can say that there is no much difference between the different areas, so the grade of the student does not depend upon the type of area.

For further analysis I will do some feature engineering on my dataset, instead of comparing grades of each exam, I will build columns one is for total where I will average out the grades of all three exams, and next, I will add one more column perf where I will categorize the marks into three categories poor, Average and Good, Poor will be assigned to the student’s whose average grades on last three exams were less than 9, Average for grades between 9 to 15 and good for grades more than 15.

#Build a column Total for the average grade and will round it to zero so that the deccimal point's will be removed. 
Math_perf$total <- round((G1+G2+G3)/3)

#build a column for perf 
Math_perf$perf <- 0 #first assign zero to the columns.
Math_perf$perf[Math_perf$total<9] <- "Poor"
Math_perf$perf[Math_perf$total>8 & Math_perf$total <16 ] <- "Average"
Math_perf$perf[Math_perf$total>15] <- "Good"
attach(Math_perf)
## The following objects are masked from Math_perf (pos = 3):
## 
##     absences, activities, address, age, Dalc, failures, famrel,
##     famsize, famsup, Fedu, Fjob, freetime, G1, G2, G3, goout, guardian,
##     health, higher, internet, Medu, Mjob, nursery, paid, Pstatus,
##     reason, romantic, school, schoolsup, sex, studytime, traveltime,
##     Walc

Research Question 3: Relation between absence and internet availaibility between student performance.

table(perf)
## perf
## Average    Good    Poor 
##     258      34     103
#Suumarize the absence of each student basesd on the performance.
Math_perf %>% group_by(perf) %>% summarise(round(mean(absences),2))
## # A tibble: 3 x 2
##   perf    `round(mean(absences), 2)`
##   <chr>                        <dbl>
## 1 Average                       6.01
## 2 Good                          4.65
## 3 Poor                          5.31
#Proportion table for performance and internet.
round(prop.table(table(internet,perf),1),2)
##         perf
## internet Average Good Poor
##      no     0.62 0.06 0.32
##      yes    0.66 0.09 0.25
### chi Square Test

# H0 = internet availability is independent of performance.
# H1 = internet availability is dependent on performance.

chisq.test(table(internet,perf))
## 
##  Pearson's Chi-squared test
## 
## data:  table(internet, perf)
## X-squared = 1.7231, df = 2, p-value = 0.4225
#p value is 0.4225 > 0.05 so i have to accept null hypothesis and reject alternate hypothesis.
#That means position is internet availability is independent of performance.

Answer: Here first I summarize the absences of an individual based on performance and take means of that, the reason behind considering the mean was that if there is variation in the count for performance like I saw earlier that the count of average performance is quite high compare to good and poor, so if I consider the sum of absence the sum would be higher in case of average performance and it will not give me the clear idea. Now from the summarized data, it’s quite clear that the average absence of poor-performing students is 5 days while for the average student it’s maxed and for god student it’s minimum. The student who takes less leave performs better compared to those who take more leaves. For the case of internet availability, I use the proportion table the reason again is to minimize the count issue, here I consider the bunch of students who has access to the internet as one group and the ones who don’t have access to the internet as the second group. Next, I count each category value as 100 and then distributed it on the percentage that out of this 100 percent what is the percentage of good, average and bad performance. From the table, I can’t say any difference between these two categories because the values were almost the same for both the case, so to get a more clear idea about it I will use chi-square test, where my confidence interval is 5 percent, and my null hypothesis is internet availability is independent on performance and alternate hypothesis is internet availability is dependent on performance. My p-value is greater than 0.05 so I have to reject an alternate hypothesis, so the internet is independent of performance. From this I can also assume that the internet doesn’t have any impact on student grade, maybe most of the students were not using the internet for the study purpose.

#Research Question 4: Relation between failures and family relation.

table(famrel,perf)
##       perf
## famrel Average Good Poor
##      1       5    1    2
##      2      11    1    6
##      3      43    4   21
##      4     130   16   49
##      5      69   12   25
### chi Square Test

# H0 = Family relationshiop is independent of student performance.
# H1 = Family relationshiop is dependent on student performance.

chisq.test(table(famrel,perf))
## Warning in chisq.test(table(famrel, perf)): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  table(famrel, perf)
## X-squared = 3.3133, df = 8, p-value = 0.9132
#p value is 0.9132 > 0.05 so i have to accept null hypothesis and reject alternate hypothesis.

table(Pstatus,failures)
##        failures
## Pstatus   0   1   2   3
##       A  33   4   2   2
##       T 279  46  15  14
ggplot(Math_perf, aes(x = as.factor(failures),fill = as.factor(Pstatus))) + 
  geom_bar(position = "dodge") + ggtitle("Comparison of parent staus by considering failure as a factor ")

Answer: To analyze this I will consider two factors first is the parent status and the second is family relation because I want to analyze if the student’s facing some issue in their family, is that have any impact on their performance. For that first, I compare the relationship variable with the performance but due to too much variance in count I can’t make any prediction because as a proportion all look same, so for that, I decided to do chi-square test again to check if these two variables were dependent or not, the result I got is that there is less than 10 percent chance that these variables were dependent on each other because my p-value is 0.91 .so the mean is same for all the group present in the attributes. Next is by analyzing the present status of the parent relationship, if the parents of the student were live apart is responsible for their children decrease in performance. To explain which I use categorical barplot and again divide this plot based on failure. Here I can see in all the cases either the student failed 3 times or didn’t fail ever the proportion of parent status of living together and apart were similar, so I can’t say that the relation of the parent’s within themselves and between child affecting their performance.

Research Question 5 : Impact of studytime, support from school and family in education on student performance.

ggplot(Math_perf, aes(y = studytime,x  = perf,fill = perf)) +
  geom_boxplot() + ggtitle("Comparison of performance based on studytime ")

Answer : To understand the impact I plot study time in comparison of performance using boxplot. Here middle line shows the mean values, and the mean value for all the performance is the same. the count of students studying between 5 to 10 hours is more for good performance compare to average and poor performance. which states that there are few instances where the student spends more time on study score more marks, and also I can see the area of boxplot for poor performance student is below the mean line means most of the student spend very less time so they perform poorly. So there is some difference in individual performance based on study time but if I take on the whole population the result is the same.

plot_ly(Math_perf,y = ~schoolsup,x = ~perf,type = 'violin',color = 'schoolsup',side = 'positive',orientation = 'h') %>% add_trace(y = ~famsup,color = 'famsup') %>% layout(violinmode = "group")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'violinmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'
round(prop.table(table(schoolsup,perf),1),2)
##          perf
## schoolsup Average Good Poor
##       no     0.65 0.10 0.25
##       yes    0.67 0.02 0.31
round(prop.table(table(famsup,perf),1),2)
##       perf
## famsup Average Good Poor
##    no     0.67 0.09 0.24
##    yes    0.64 0.08 0.27

The support of family and school is always appreciated for the success of the student because the student feels motivated if they got support from their surroundings, is this motivation work in improving the grades of the student, to answer this I will analyze the relationship between these factors. Here I plot the area graph to see the concentration of the student based on their performance. The interpretation of the graph is very simple whenever the area got wider the number of students is more, here in the case of family support I can see when the student got support in education from the family there performance improved compared to the situation where they got support from the school. So the support from school doesn’t have much impact on the student grade but the family support improves their performance because I can see the poor performance are got narrower in the case of family support compare to support from the school.

#Research Question 6 : Impact of parent’s job and Education background on student grades,

Math_perf$Medu[Math_perf$Medu == 0] <- 'None'
Math_perf$Medu[Math_perf$Medu == 1] <- '4th Grade'
Math_perf$Medu[Math_perf$Medu == 2] <- '5th to 9th Grade' 
Math_perf$Medu[Math_perf$Medu == 3] <- 'Secondary Education'
Math_perf$Medu[Math_perf$Medu == 4] <- 'Higher Education'


plot_ly(Math_perf, y = ~total, x = ~Mjob, type = "box",color =I('red')) %>% add_trace(x = ~Medu,color = I("pink")) %>% layout(boxmode = "group",title = "Impact of Mother job and education background on student grade",yaxis = list(title ="Average garde on last three exam"),xaxis = list(title = "Job and Education background category"))
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'
Math_perf$Fedu[Math_perf$Fedu == 0] <- 'None'
Math_perf$Fedu[Math_perf$Fedu == 1] <- '4th Grade'
Math_perf$Fedu[Math_perf$Fedu == 2] <- '5th to 9th Grade' 
Math_perf$Fedu[Math_perf$Fedu == 3] <- 'Secondary Education'
Math_perf$Fedu[Math_perf$Fedu == 4] <- 'Higher Education'

plot_ly(Math_perf, y = ~total, x = ~Fjob, type = "box",color =I('blue')) %>% add_trace(x = ~Fedu,color = I("orange")) %>% layout(boxmode = "group",title = "Impact of father job and education background on student grade",yaxis = list(title ="Average garde on last three exam"),xaxis = list(title = "Job and Education background category"))
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'uniformtext', 'autosize', 'width', 'height', 'margin', 'computed', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'showlegend', 'colorway', 'datarevision', 'uirevision', 'editrevision', 'selectionrevision', 'template', 'modebar', 'newshape', 'activeshape', 'meta', 'transition', '_deprecated', 'clickmode', 'dragmode', 'hovermode', 'hoverdistance', 'spikedistance', 'hoverlabel', 'selectdirection', 'grid', 'calendar', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'polar', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'editType', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'colorscale', 'coloraxis', 'metasrc', 'barmode', 'bargap', 'mapType'

Answer : It is no secret that parents are the primary influence in their children???s lives, guiding what they eat, where they live, and even what they wear. But parents influence their children in a far more important way, According to the research published by Lamar University, Texas, USA parents??? education level has a significant impact on their children???s success. so to see the impact on grade here I will consider both the education and job background as my factor. Before analysis here I transform the education column from numeric to factor for better understanding. In case of mother I can see the grades of the student whose mother’s have a higher degree is quite good compare to those have a low education background, the trend is upward, means the grades start increasing once the education level of the mother increases, Because it’s lowest in case of a mother is only 4th-grade pass but highest in cases were mother attended higher education. Next is job If the mother is working in the field of health domain the grades of the student were higher on these conditions, while if I consider the situation where a mother is a homemaker or involve in another type of things the grades of the student were average. I can see the same type of trend in education background of father too, here also the student whose father attended higher degree perform farm better than others, so education background of both father and mother impact in a positive way on child performance. But if I consider a job background of the father as a factor, the student whose father is teacher perform better compare to another student, and irrespective of the case I saw in mother, here if the father stays at home or work in the field of medical the performance of the student is same.SO both father and mother job have their influence on student performance.

Modeling : After analyzing differnt factor now i can do the modeling part, for that i will consider regression because it will create relation and give me best fit equation if the target variable have good correlation between independent variable. But before dive into that i will build one correlation matrix to see what is the correlation value do i have between these attributes.

# Compute a correlation matrix
corr <- round(cor(Math_perf[,c(3,13:15,24:33)]), 1)
head(corr)
##            age traveltime studytime failures famrel freetime goout Dalc Walc
## age        1.0        0.1       0.0      0.2    0.1      0.0   0.1  0.1  0.1
## traveltime 0.1        1.0      -0.1      0.1    0.0      0.0   0.0  0.1  0.1
## studytime  0.0       -0.1       1.0     -0.2    0.0     -0.1  -0.1 -0.2 -0.3
## failures   0.2        0.1      -0.2      1.0    0.0      0.1   0.1  0.1  0.1
## famrel     0.1        0.0       0.0      0.0    1.0      0.2   0.1 -0.1 -0.1
## freetime   0.0        0.0      -0.1      0.1    0.2      1.0   0.3  0.2  0.1
##            health absences   G1   G2   G3
## age          -0.1      0.2 -0.1 -0.1 -0.2
## traveltime    0.0      0.0 -0.1 -0.2 -0.1
## studytime    -0.1     -0.1  0.2  0.1  0.1
## failures      0.1      0.1 -0.4 -0.4 -0.4
## famrel        0.1      0.0  0.0  0.0  0.1
## freetime      0.1     -0.1  0.0  0.0  0.0
ggcorrplot(corr, hc.order = TRUE, type = "lower",lab = TRUE)

I can see here that the grades were dependent on each other which is obvious because the student who perform well in first period exam, definatley his/her next two periods grades were close to the first one,apart from this the grade have negative correlation between failure, which means if the student failed more than once in his/her previous session the grades were also decreasing, so the grades is inversly proportional to the failures here, apart from this there is no other factor which were dependent on the grades of the student.

#Regression model 1
model_1 <- lm(G3 ~ G1 + G2 + failures + absences + Fedu + Fjob + Medu + Mjob + famsup + school + 
                sex + studytime,data = Math_perf)

summary(model_1)
## 
## Call:
## lm(formula = G3 ~ G1 + G2 + failures + absences + Fedu + Fjob + 
##     Medu + Mjob + famsup + school + sex + studytime, data = Math_perf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8256 -0.5103  0.3402  0.9884  4.0110 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.54693    0.72094  -2.146  0.03255 *  
## G1                       0.16507    0.05875   2.810  0.00522 ** 
## G2                       0.96811    0.05079  19.060  < 2e-16 ***
## failures                -0.34784    0.15014  -2.317  0.02106 *  
## absences                 0.03621    0.01249   2.900  0.00396 ** 
## Fedu5th to 9th Grade    -0.72250    0.30085  -2.401  0.01682 *  
## FeduHigher Education    -0.74628    0.39723  -1.879  0.06107 .  
## FeduNone                -0.29380    1.39693  -0.210  0.83353    
## FeduSecondary Education -0.32047    0.33950  -0.944  0.34581    
## Fjobhealth               0.86901    0.65675   1.323  0.18659    
## Fjobother                0.23124    0.46368   0.499  0.61827    
## Fjobservices            -0.07653    0.48221  -0.159  0.87398    
## Fjobteacher              0.14467    0.59892   0.242  0.80926    
## Medu5th to 9th Grade    -0.17331    0.35021  -0.495  0.62099    
## MeduHigher Education     0.27655    0.46908   0.590  0.55584    
## MeduNone                 1.16569    1.16003   1.005  0.31561    
## MeduSecondary Education  0.20529    0.38767   0.530  0.59674    
## Mjobhealth              -0.17421    0.49791  -0.350  0.72663    
## Mjobother                0.05798    0.31855   0.182  0.85568    
## Mjobservices             0.19482    0.35424   0.550  0.58267    
## Mjobteacher             -0.12530    0.46665  -0.269  0.78845    
## famsupyes                0.22945    0.21259   1.079  0.28114    
## schoolMS                 0.14087    0.31709   0.444  0.65712    
## sexM                     0.20646    0.21473   0.961  0.33694    
## studytime               -0.18788    0.12980  -1.447  0.14861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.912 on 370 degrees of freedom
## Multiple R-squared:  0.8365, Adjusted R-squared:  0.8259 
## F-statistic: 78.85 on 24 and 370 DF,  p-value: < 2.2e-16
model_1 %>%  plot(which = 2)

#Regression model 2
model_2 <- lm(G3 ~ G1 + G2 +log(failures+1) + absences,  data = Math_perf)

summary(model_2)
## 
## Call:
## lm(formula = G3 ~ G1 + G2 + log(failures + 1) + absences, data = Math_perf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3488 -0.3397  0.3027  0.9360  3.7427 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.67185    0.37747  -4.429 1.23e-05 ***
## G1                 0.14075    0.05558   2.532  0.01172 *  
## G2                 0.97540    0.04908  19.873  < 2e-16 ***
## log(failures + 1) -0.63012    0.26400  -2.387  0.01747 *  
## absences           0.03885    0.01205   3.224  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.906 on 390 degrees of freedom
## Multiple R-squared:  0.8287, Adjusted R-squared:  0.8269 
## F-statistic: 471.7 on 4 and 390 DF,  p-value: < 2.2e-16
model_2 %>%  plot(which = 2)

#Regression Model 3
model_3 <- lm( total ~ failures +log(absences + 1) + sex +  studytime  ,data = Math_perf)

summary(model_3)
## 
## Call:
## lm(formula = total ~ failures + log(absences + 1) + sex + studytime, 
##     data = Math_perf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8398 -2.2777  0.0328  2.4050  9.0153 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.1076     0.6150  14.809  < 2e-16 ***
## failures           -1.7918     0.2340  -7.658  1.5e-13 ***
## log(absences + 1)   0.3486     0.1623   2.147  0.03239 *  
## sexM                1.1854     0.3601   3.292  0.00108 ** 
## studytime           0.5468     0.2177   2.512  0.01242 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.4 on 390 degrees of freedom
## Multiple R-squared:  0.173,  Adjusted R-squared:  0.1645 
## F-statistic: 20.39 on 4 and 390 DF,  p-value: 2.858e-15
model_3 %>% plot(which = 2)

#Result Discussion: Here first I try to build my model by considering all the attributes on which I did the EDA and those who have good correlation value with the dependent variable, but in the first model, the p values are not acceptable for education, job, school, and sex. The reason behind it may be there are only one or two factors that are having a dependency on the target variable, which I saw at the time of EDA that the grades were almost the same if the father’s job was healthy or he stays at home. So in the next model, I decided to remove these attributes, and consider only grades of the first two periods, failure and absence, Now all my p values are less the 0.05, and also my R-Squared value is 0.83 which is quite good.

Limitation: Now as I got the good R-Squared value, is this model efficient, because my task was to analyze the impact of social and other factor’s on student grades, but the high R-squared value I got was due to the Grades were correlated. so to analyze the impact I build another model by considering total as my target variable and here didn’t consider the grades, here my independent variable is only the failure, study time, absence, and sex. The p-value is significant here for all the variables but the RSquared and Adjusted R-squared is not good, the reason behind this is these variable are impacting my target variable but the dependency is very less, which means I can say that yes absence rate, study time, failures are dependent on the performance but the dependency ratio is less.

                                  # Conclusion #

After analyzing the different attributes present in the dataset I can say that Socioeconomic factors play a significant role in the achievement of high-graders in the mathematics examination. The results of my study suggest that socioeconomic background is a major predictor of a student’s grade, either it is a parent’s education, job, or support in the education by parents. Parent involvement is also effective in predicting math achievement, but demographic factor doesn’t add much value to the performance of the student because I saw either the student belongs to a rural or urban area their average performance is almost the same. The results of my study also suggested that economic variables and attitudinal factors significantly impacted each of the student grades in the last three math exams. In the end, as I saw the impact of these variables is not that effective in regression, so I want to suggest to Researchers that for better analysis they should consider a larger population next time, and the result should not be focused on one subject. And if they want to consider only mathematics as a subject then they should include teachers??? attitudes toward mathematics in addition to the attitudes of students and parents. The lack of a strong correlation between parent involvement and student achievement may increase if teachers??? attitudes were considered. Researchers can conduct replication studies by expanding the population to additional grade levels, rather than at one grade level, to provide more in-depth results. Another idea for a replication study would be to examine predictors of student achievement rather than math achievement or to conduct a study at the primary, middle, or high school level to examine predictors of math achievement.

#References